CitiBike trip data from 2013 to 2016

Citi Bike is the largest bike sharing program in the United States. It opened in New York City in May 2013, with 6,000 bicycles and 330 docking stations in Manhattan and Brooklyn. Recently they have released all the trip data, which include the anoynomized trip data from 2013/7 until 2016/12. There are more than 35 million in total, it should be fasinating to explore!

On the personal side, I signed up for its annual membership at its launch, and I really enjoyed the benefits. I must have contributed quite some trips in the data.

What are we trying to do here?

Yes now we have 10 GB of, potentially, interesting data, it is very tempting to jump right into the rabbit hole and start explore the data. Let me pause and ask myself the question: what are we trying to do here? Is there an imminent problem that needs to be solved?

Let me speculate - the questions might be totally meaningless and stupid, but at least they can keep me somewhat focused, and hopefully some of them can shed some insight, say, how efficient is the system, how much we can predict its future behavor. By the end of this trip, hopefully we can come away with some simple take home messages.

Questions and potential outcome of this project:

  • Build a model to predict bike use for next hour/day
  • Build a bike-transfer protocol, if the current one is not optimized
  • Given some initial conditions, can we predict the destination of a trip?
  • Given a trip, can we guess the rider is a subscriber or a customer?
  • For a specific station, can we predict its in-out bike for next hour/day?
  • Is there any correlation between the travel patterns and the demographic of the city?

With these in mind, let's dive in to the data.

Data Processing

Due to the large size of the data (~ 10 GB in total), it is very difficult to handle all the data directly with python (pandas), therefore, I will only focus on part of the data, for exmaple, only in 2104. I've also uploaded the data to AWS, and ran some simple SQL query to the whole dataset with AWS's Athena. The latter part is still an onging process.

The raw data is already nicely parsed, with data columns of:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
  • Gender (0 = unknown; 1 = male; 2 = female)
  • Year of Birth From these, I will populate some derived columns as needed.

Data Overview

Let's first take a overall look at the data. There are 8,040,801 trips recorded during 2014, contributed by both annual members and day- / week-pass users. Among them, the annual members (usertype == 'Subscriber') made 90.408% of the trips.

The data also recorded the users' gender, and for subscribers, the birth years of the users are also available. I take a quick look at those distributions. Among all the trips, 69.929% are taken by males, and 20.461% are taken by females. For the rest, the gender is not specified. In another word, the number of male riders are ~ 3.5 times of that of female.

We can also look at the age distributions, of both male and female riders. It is clear that, in both cases, people in the 30s are the majority, while for female older than 40, the distribution falls off faster than male.

I am using the bokeh package for the data visualization, due to its interactive feature.

In [1]:
import bokeh
from IPython.core.display import display, HTML
display(HTML('age_distribution_2014.html'))
Bokeh Plot

<a name = 'time'></a>

Exploration with repsect to time

Next, let's how long - on average - a trip takes. There are quite some few outlier in the data, for example, the duration between the start and end of a trip on the order of days. Therefore, I will only look at the data of 99.5 percentile of the trip duration. Also, I have the gender information for the subscribers, therefore, I will break down its distribution to genders.

In [2]:
display(HTML('trip_duration_distribution_2014.html'))
Bokeh Plot

From the above plot, it seems that, customers are riding longer (distribution peaks at ~20 minutes) than subscribers (distribution peaks at ~6 minutes), which makes intuitive sense: customers (riders with day/week pass) are more likely to be tourists, while the subscribers are more likely to use citibike for commuting. Among the subscribers, there is no significant difference between male and female.

Next let's take a look at how the trips are taken everyday, among customers and subscribers.

In [3]:
display(HTML('total_trip_by_day_2014.html'))
Bokeh Plot

Now there are many features!

First of all, the total number of trips taken by subscriber is larger than that of customer, but that's yesterday's news, we knew that ratio is about 9:1.

Secondly, the people ride more often during summer time than during winter. Well that's expected as well. A better quantitative measure could be the correlation between the daily trips count and the weather (e.g., temperature, precipitation). I will fetch the daily weather data later, in order to build such relation. However, aournd Jan. 15, there is spike for the subscribers. My first guess is that the weather was very warm during those days. From the data, it may be plausible reason - the average temperature during the week stayed close to or above the normal high: the dip on Jan. 14 (Tuesday) is due the rain on that day. Furthermore, what makes the spike even more pronouced is that dip follows it (next week), and again I will attribute the reason to weather - the temperature were extremely low during that week. The same weather assumption can be applied to the dip around Feb. 15.

Third, on a finer scale, there are multiple dips in the subscriber curve and the peaks in the customer curve, and they are nicely correlated. This is not totally unexpected though: the subscribers are more likely to use the bike for commute (during the weekdays), while the customers / tourists are more likely to use the bikes over the weekends. Of course, I would guess weather will play some part this as well. To see this more clearly, I plot the daily trip count as a function of the day of the week, to drive this message home.

In [4]:
display(HTML('daily_trip_by_week_2014.html'))
Bokeh Plot

Once I am at it, there is nothing stops me from diving into even finer time scale - to one single day, and we will arrive at the following result.

In [5]:
display(HTML('daily_trip_by_hour_2014.html'))
Bokeh Plot

The two prominent peak for subscribes on the weekdays plot should really make the commuting message clear: during the weekdays, majority of the riders are using the bike for cummuting purpose. In other cases, people tend to use the bikes during mid-day.

<a name = 'location'></a>

Exploration with repsect to location

Another important piece of information that is attached to each trip is the location, of both the start station and end station. There are many angles that we can look at this information, and of course we can easily get lost. Let me first try to have a 'big' picture on the dynamic of the stations.

TODO: plot all the stations

There are 332 unique stations (from the 2014 data), I simply count the total trips starting and ending at each station, and then rank all the stations with the total number of in / out trips. Additionally, I can also record what is the total net change of trips / bikes of each station. If the net change is close to zero, it implies that this station is operated optimally. Of course, there might be bikes transferred in and out by the company, which is not directly revealed by this counting (but we will address transfer this later).

In [6]:
display(HTML('busiest_stations_2014.html'))
Bokeh Plot

The number 1 station (8 Av. & 31 St) locates right outside Penn Station, with a large net out-going bikes - this is understandable, considering that Penn Station is a major train station, that handles trains to/from New Jersey, and Queens. The number 2 station (Lafayette St & E 8 St) locates close to East Village, with somewhat balanced in-out bike flow, there is no major transportation hub nearby, and the large bike traffic might be due to the activities in this area (purely speculation). The number 3 station (E 42 St & Vanderbilt Ave) locates outside another major transportation hub - Grand Central, also with a large net out-going bikes. There are few more stations with markedly net out-going bikes, and they're all close to major traffice hub (train, subway, path). There is one exception though, which is W 33 St & 7 Ave station: it is also outside Penn Station, but it has a large large net incoming bikes. What does this mean? People tend to check out bikes from west side of the Penn Station, and return bikes to the east side of the Penn Station?

TODO: take a close look at the three most busy stations

Next I will turn my eyes to the "actual trips", with the question which routes (defined as starting station --> end station) are most popular? Riders can of course have many choices going from point A to B, but here I will simply ignore this variance. I will look at both (1) how many unique routes are there, and (2) what are top-20 most popular routes?

In [7]:
display(HTML('distinct_route_2014.html'))
Bokeh Plot
In [8]:
display(HTML('popular_route_2014.html'))
Bokeh Plot

It is clear that, for customers, the routes are more concentrated, compared to the cases for subscribers. Furthermore, for customers, the routes are centered around Central Park, and for subscribers, the routes most likely startes / ends around transportation hubs.

Exploration with repsect to both location and time

Now I have quickly explored the trips with respect to time and location separately, let me take a combined view, for example, how fast - on average - does a rider bike? To do so, I need to know the distance of each trip: for the sake of simplicity, I will assume the rider will take the direct route suggested by Google Maps. It is a crude assumption, but it will get me started. There is still a problem with all the roundtrip (i.e.) starting from and ending at the same station. For now, I will make the distance of such trips -1.

Next I will need to know the distance between each stations. For this purpose, I wrote a small script to query Google Maps API, asking for the distances between each of the 332 stations, and built a distance matrix (with size of 332 x 332). There were ~ 50,000 unique queries, and it took me a while to complete the matrix, due to the rate and query limit. Nevertheless, once I had the matrix, I can populate speed distribution.

In [9]:
display(HTML('speed_distribution_2014.html'))
Bokeh Plot

It seems that subscribers usually ride faster than customers / tourists (not really surprising), and male rides bike faster than female riders (not surprising either). But the value here is that, we have a somewhat quantitative measure of the difference: roughly speaking, the subscribers rides 2 mph faster than the customer.

Now we have gotten hold to both space and time, sort of, then there are millions of patterns we can try to spot and visualize. But let me curb my curosity here, and try to answer some questions and make some predictions: after all, we need the data to do something for us.

What can we learn from the data?

Then what we can learn from the data? Blessed by ignorance, let me ask a seemingly straightfoward question: given a trip, can we predict whether it is taken by a customer of a subscriber? Can we build a model to make the predition?

Of course we can build a model for this question, however, to build a good model in this case is not that trivial. The reason is that, there are 90.408% trips made by the subscribers, in another word, most of the trips are made by a subscriber, if I just simply guess uniformly any trip is made by a subscriber, I will be right 90.408% of the time. This high benchmark leaves it not too much room to improve. In fact, I've run some simple logistic regressions, with 'tripduration', 'hour', 'distance', 'speed', 'weekday' and station ids as features, the cross-validation score is just 91.23%: I am only 1% better than guess. In the hope to improve the performance, I also download the weather data, and add the daily average temperature and precipitation as new features, and they didn't move the needle.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: